Explore and Summarize data (red wine quality) by Nada Alsaab

Introduction

In this project, Red Wine Quality data is used to know the relationship between the wine features.

Start_data_exloratory

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "quality2"
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ quality2            : num  5 5 5 6 5 5 5 7 7 5 ...

This dataset has 1599 observations and 13 varibales. quality variable has two columns, one is ordinal and other is numeric.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18  
##     quality2    
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Median quality is 5.68, while mean quality is almost 6.64.

Univariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distributed data. The range of fixed acidity is between 4 and 16. The most fixed acidity is between 7 and 8.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distributed data. The range of volatile acidity is between 0.1 and 1.6. The most volatile acidity is between 0.4 and 0.7.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a right skewed distribution. The range of the critic acid is between 0 and 1.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a right skewed distribution. The range of residual sugar is between 1 and 16. The most residual sugar is between 1.5 and 3.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distribution. The range of chlorides is between 0.0 and 0.65. The most chlorides is between 0.05 and 0.1.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a right skewed distribution. The range of free sulfur dioxide is between 0 and 72. The most free sulfur dioxide is between 1 and 10.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a right skewed distribution. The range of the total sulfur dioxide is between 0 and 300. The most total sulfur dioxide is between 10 and 50.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distribution. The range of density is between 0.99 and 1.005. The most density is between 0.995 and 1.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distribution. The range of pH is between 2.75 and 4.25. The most pH is between 3.25 and 3.5.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a normal distribution. The range of sulphates 0.25 and 2. The most sulphate is between 0.5 and 0.75.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a right skewed distribution. The range of alcohol is between 8 and 15. The most alcohol is between 9 and 10.

Quality is categorical, the range is between 3 and 8, the highest value is at 5.

I will recategorize the quality as a rank. low = < 5 good = (from 5 to 6) very good = > 6

# Recategorize the quality as rank (low, good, very good)
wine_data$rank <- ifelse(wine_data$quality < 5, 'low', ifelse(
  wine_data$quality < 7, 'good', 'very good'))
wine_data$rank <- ordered(wine_data$rank, levels = c('low', 'good', 'very good'))
summary(wine_data$rank)
##       low      good very good 
##        63      1319       217
ggplot(data = wine_data, aes(x=rank, fill=rank)) +
  geom_bar() + theme_minimal() +
  scale_fill_brewer(type = 'seq', palette = 4)

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observations and 12 variables.

What is/are the main feature(s) of interest in your dataset?

Quiality is the main feature interest.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think Alcohol, pH, volatile acidity and total sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

Yes, I created rank variable to recategorize the quality as (low, good, very good).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The data set is tidy and good and I did not do anything to it. I use it as it is.

—————————————————————————–

Bivariate Plots Section

## Warning in ggscatmat(wine_data, columns = 1:13): Factor variables are
## omitted in plot

The higher absolute value of correlation coeffeceint, the higher the relationship between the two factors. From the matrix, we can notice that the highest absolute value of correlation coeffeceint is between pH and fixed.acidity with a value of -0.68. The second highest relationship is between density and fixed.acidity with 0.67, and also between citric.acid and fixed.acidity with 0.67.

pH vs fixed.acidity

# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity", 
       x="pH", y ="Fixed acidity")

From the figure above, we can notice that there is a negative strong relationship between pH and fixed.acidity

density vs fixed.acidity

# Box plot
ggplot(aes(x = density, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm') +
labs(title="Density VS Fixed acidity", 
       x="Density", y ="Fixed acidity")

From the figure above, we can notice that there is a positive strong relationship between density and fixed.acidity

citric.acid vs fixed.acidity

# Box plot
ggplot(aes(x = citric.acid, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm') +
labs(title="Citric acid VS Fixed acidity", 
       x="Citric acid", y ="Fixed acidity")

From the figure above, we can notice that there is a positive strong relationship between citric.acid and fixed.acidity

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

After the analysis, the most two factors that have the stronest relationship is between pH and fixed.acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

From the matrix, I notices that most factor that affects the quality is “alcohol” with correlation coeffecenit equals 4.8.

What was the strongest relationship you found?

The strongest relationship between all the factors is between fixed.acidity and citric.acid.

—————————————————————————–

Multivariate Plots Section

# Box plot
ggplot(aes(x = quality, y = alcohol), data = wine_data) +
  geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) + geom_boxplot(alpha = 0.5) + scale_colour_brewer(palette=3)+
labs(title="Quality VS Alcohol", 
       x="Quality", y ="Alcohol")

We can notice from the figure above that the most distribution is in between 4.5 and 7.5. And there is a positive relationship between quality and alcohol.

quality vs volatile.acidity

# Box plot
ggplot(aes(x = quality, y = volatile.acidity), data = wine_data) +
  geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) + geom_boxplot(alpha = 0.5) + scale_colour_brewer(palette=3)+
  labs(title="Quality VS Volatile acidity", 
       x="Quality", y ="Volatile acidity")

We can notice from the figure above that the most distribution is in between 4.5 and 7.5. And there is a negative relationship between quality and volatile acidity.

# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity", 
       x="pH", y ="Fixed acidity")

From the figure above, we can notice that there is a negative strong relationship between pH and fixed acidity and the most rank is “good”.

# Box plot
ggplot(aes(x = density, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm') +
labs(title="Density VS Fixed acidity", 
       x="Density", y ="Fixed acidity")

From the figure above, we can notice that there is a positive strong relationship between density and fixed acidity. Also, we can notice that the “good” rank has the strongest relationship.

citric.acid vs fixed.acidity

# Box plot
ggplot(aes(x = citric.acid, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm') +
labs(title="Citric acid VS Fixed acidity", 
       x="Citric acid", y ="Fixed acidity")

From the figure above, we can notice that there is a positive strong relationship between citric.acid and fixed acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

# Recategorize the quality as rank (low, good, very good)
wine_data$rank <- ifelse(wine_data$quality < 5, 'low', ifelse(
  wine_data$quality < 7, 'good', 'very good'))
wine_data$rank <- ordered(wine_data$rank, levels = c('low', 'good', 'very good'))
summary(wine_data$rank)
##       low      good very good 
##        63      1319       217
ggplot(data = wine_data, aes(x=rank, fill=rank)) +
  geom_bar() + theme_minimal() +
  scale_fill_brewer(type = 'seq', palette = 4)

Description One

Snice quality has 6 categoried which is as a number, I decide to recategorize them to be more clear, and this figure describe the categories. (low, good, very good)

Plot Two

# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm')

Description Two

The correlation coefficient between pH and Fixed acidity is negative and that indicates there a reverse relationship. Also, the correlation coefficient has the highest absolute value which indicates that the relationship is the strongest.

Plot Three

# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity", 
       x="pH", y ="Fixed acidity")

Description Three

The correlation coefficient between pH and Fixed acidity is negative and that indicates there a reverse relationship. Also, the correlation coefficient has the highest absolute value which indicates that the relationship is the strongest. And as we notice from the figure above, most red wines are in “good” quality.

Reflection

First, I chose red wine dataset which has 1599 kind of wines and 12 variables, I start to explore the data and understand it by exploring and visualizing every variable. Then, I tried to answer the questions by finding their answers from analysis, visualization and matrices. I tried to find the relationship between factors, which is the the strongest relatioship, which is positive and which is negative. I noticed that the strogest relationship is between pH and Fixed acidity but it is a negative relationship. I explored the most factors that affect the wine quality, the most effective factor is alcohol, then volatile acidity. I enjoyed this analysis and I think IF we have larger dataset we can have more findings.